Identification of MIR-Flickr Near-duplicate Images - A Benchmark Collection for Near-duplicate Detection

نویسندگان

  • Richard C. H. Connor
  • Stewart MacKenzie-Leigh
  • Franco Alberto Cardillo
  • Robert Moss
چکیده

There are many contexts where the automated detection of near-duplicate images is important, for example the detection of copyright infringement or images of child abuse. There are many published methods for the detection of similar and near-duplicate images; however it is still uncommon for methods to be objectively compared with each other, probably because of a lack of any good framework in which to do so. Published sets of near-duplicate images exist, but are typically small, specialist, or generated. Here, we give a new test set based on a large, serendipitously selected collection of high quality images. Having observed that the MIRFlickr 1M image set contains a significant number of near-duplicate images, we have discovered the majority of these. We disclose a set of 1,958 near-duplicate clusters from within the set, and show that this is very likely to contain almost all of the near-duplicate pairs that exist. The main contribution of this publication is the identification of these images, which may then be used by other authors to make comparisons as they see fit. In particular however, near-duplicate classification functions may now be accurately tested for sensitivity and specificity over a general collection of images.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Quantifying the Specificity of Near-duplicate Image Classification Functions

There are many published methods for detecting similar and near-duplicate images. Here, we consider their use in the context of unsupervised near-duplicate detection, where the task is to find a (relatively small) nearduplicate intersection of two large candidate sets. Such scenarios are of particular importance in forensic near-duplicate detection. The essential properties of a such a function...

متن کامل

A Near-duplicate Detection Algorithm to Facilitate Document Clustering

Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting Near Duplicates is very difficult in large collection of data like ”internet”. The presence of these web pages plays an important role in the performance degradation while integrating data from heterogeneous sources. These pages either increase the index storage space or increase the serving costs. Detecting t...

متن کامل

New Issues in Near-duplicate Detection

Near-duplicate detection is the task of identifying documents with almost identical content. The respective algorithms are based on fingerprinting; they have attracted considerable attention due to their practical significance for Web retrieval systems, plagiarism analysis, corporate storage maintenance, or social collaboration and interaction in the World Wide Web. Our paper presents both an i...

متن کامل

Performance of near-duplicate detection algorithms for Crawljax

On the web near-duplicate documents are abundant. As many as 40%of the pages on the Web are near-duplicates of other pages, according toManning et al. [10]. A web crawler should be able to recognize and dealwith near-duplicate web pages.In this survey we will first explore the most prominent duplicate-detectionalgorithms, which could be viable implementations in Crawljax...

متن کامل

Query Based Duplicate Data Detection on WWW

The problem of finding relevant documents has become much more prominent due to the presence of duplicate data on the WWW. This redundancy in results increases the users’ seek time to find the desired information within the search results, while in general most users just want to cull through tens of result pages to find new/different results. The identification of similar or near-duplicate pai...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015